Comparative Performance Analysis of Intel Xeon Phi, GPU, and CPU
نویسندگان
چکیده
We study and characterize the performance of operations in an important class of applications on GPUs and Many Integrated Core (MIC) architectures. Our work is motivated by applications that analyze low-dimensional spatial datasets captured by high resolution sensors, such as image datasets obtained from whole slide tissue specimens using microscopy scanners. Common operations in these applications involve the detection and extraction of objects (object segmentation), the computation of features of each extracted object (feature computation), and characterization of objects based on these features (object classification). In this work, we have identify the data access and computation patterns of operations in the object segmentation and feature computation categories. We systematically implement and evaluate the performance of these operations on modern CPUs, GPUs, and MIC systems for a microscopy image analysis application. Our results show that the performance on a MIC of operations that perform regular data access is comparable or sometimes better than that on a GPU. On the other hand, GPUs are significantly more efficient than MICs for operations that access data irregularly. This is a result of the low performance of MICs when it comes to random data access. We also have examined the coordinated use of MICs and CPUs. Our experiments show that using a performance aware task strategy for scheduling application operations improves performance about 1.29× over a first-come-first-served strategy. This allows applications to obtain high performance efficiency on CPU-MIC systems - the example application attained an efficiency of 84% on 192 nodes (3072 CPU cores and 192 MICs).
منابع مشابه
Performance Analysis of an Astrophysical Simulation Code on the Intel Xeon Phi Architecture
We have developed the astrophysical simulation code XFLAT to study neutrino oscillations in supernovae. XFLAT is designed to utilize multiple levels of parallelism through MPI, OpenMP, and SIMD instructions (vectorization). It can run on both CPU and Xeon Phi co-processors based on the Intel Many Integrated Core Architecture (MIC). We analyze the performance of XFLAT on configurations with CPU ...
متن کاملFirst Evaluation of the CPU, GPGPU and MIC Architectures for Real Time Particle Tracking based on Hough Transform at the LHC
Recent innovations focused around parallel processing, either through systems containing multiple processors or processors containing multiple cores, hold great promise for enhancing the performance of the trigger at the LHC and extending its physics program. The flexibility of the CMS/ATLAS trigger system allows for easy integration of computational accelerators, such as NVIDIA’s Tesla Graphic...
متن کاملFast computation of computer-generated hologram using Xeon Phi coprocessor
We report fast computation of computer-generated holograms (CGHs) using Xeon Phi coprocessors, which have massively x86-based processors on one chip, recently released by Intel. CGHs can generate arbitrary light wavefronts, and therefore, are promising technology for many applications: for example, three-dimensional displays, diffractive optical elements, and the generation of arbitrary beams. ...
متن کاملEvaluation of the Intel Xeon Phi 7120 and NVIDIA K80 as accelerators for two-dimensional panel codes
To optimize the geometry of airfoils for a specific application is an important engineering problem. In this context genetic algorithms have enjoyed some success as they are able to explore the search space without getting stuck in local optima. However, these algorithms require the computation of aerodynamic properties for a significant number of airfoil geometries. Consequently, for low-speed...
متن کاملIs Altera’s OpenCL SDK ready for business?
Software engineers can now add FPGAs to their HPC arsenal with Altera’s OpenCL SDK along with compatible FPGA platforms from vendors such as Nallatech. We have used a case study of challenging problems from Computational Finance to verify this claim. Our OpenCL FPGA implementations achieved performance 256 times faster than a sequential CPU implementation, which is better than the 240 times spe...
متن کاملRegularizing graph centrality computations
Centrality metrics such as betweenness and closeness have been used to identify important nodes in a network. However, it takes days to months on a high-end workstation to compute the centrality of today’s networks. The main reasons are the size and the irregular structure of these networks. While today’s computing units excel at processing dense and regular data, their performance is questiona...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- IEEE transactions on parallel and distributed systems : a publication of the IEEE Computer Society
دوره 2014 شماره
صفحات -
تاریخ انتشار 2014